# Distribution of PM2.5summary(data_2002$Daily.Mean.PM2.5.Concentration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 7.00 12.00 16.12 20.50 104.30
hist(data_2002$Daily.Mean.PM2.5.Concentration,main ="PM2.5 Distribution - 2002",xlab ="Daily Mean PM2.5",col ="lightblue")
summary(data_2022$Daily.Mean.PM2.5.Concentration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.700 4.100 6.800 8.414 10.700 302.500
hist(data_2022$Daily.Mean.PM2.5.Concentration,main ="PM2.5 Distribution - 2022",xlab ="Daily Mean PM2.5",col ="lightgreen")
Basic Map of Monitoring Sites
Based on the maps, it appears that the distribution remained relatively the same, however there were increases in PM2.5 measurements in the Central Valley and Bay Area from 2002 to 2022.
Though there are no missing data, there are some values which seem implausible. Specifically, values indicating a negative PM2.5 level are implausible, and of which there are 215 values in the combined dataset. When checking the proportion of negative values by year, there were 0 reported in 2002, with all 215 negative observations having been reported in 2022.
2002
Missing values: 0
Implausible (negative) values: 0
2022
Missing values: 0
Implausible (negative) values: 215
Looking at the histogram, the season with the most negative values is winter. Summer also saw an uptick of negative values being reported. There were few negative values reported in the spring and autumn.
# Missing valuessum(is.na(combined_data$PM25))
[1] 0
# Negative values (not possible for PM2.5)sum(combined_data$PM25 <0)
[1] 215
# Very high values (above 1000 µg/m³)sum(combined_data$PM25 >1000)
[1] 0
# Missing values by yeartapply(is.na(combined_data$PM25), combined_data$Year, sum)
2002 2022
0 0
# Negative values by yeartapply(combined_data$PM25 <0, combined_data$Year, sum)
2002 2022
0 215
# Very high values by yeartapply(combined_data$PM25 >1000, combined_data$Year, sum)
2002 2022
0 0
# Total observations by yeartable(combined_data$Year)
2002 2022
15976 59918
#Temporal changes## Filter negative valuesneg_2002 <- combined_data[combined_data$Year ==2002& combined_data$PM25 <0, ]neg_2022 <- combined_data[combined_data$Year ==2022& combined_data$PM25 <0, ]## Ensure Date is in Date formatneg_2002$Date <-as.Date(neg_2002$Date)neg_2022$Date <-as.Date(neg_2022$Date)# Histogram for 2002 (no negative values, so chart is blank)ggplot(neg_2002, aes(x = Date)) +geom_histogram(binwidth =7, fill ="darkred", color ="white") +labs(title ="Negative PM2.5 Values in 2002",x ="Date", y ="Count") +theme_minimal()
The histogram for 2002 is blank to indicate there are no negative values. All negative values were reported in 2022.
# Histogram for 2022ggplot(neg_2022, aes(x = Date)) +geom_histogram(binwidth =30, fill ="darkred", color ="white") +labs(title ="Negative PM2.5 Values in 2022",x ="Date", y ="Count") +theme_minimal()
Warning: Removed 101 rows containing non-finite outside the scale range
(`stat_bin()`).
Exploring Spatial Data at Three Levels
Primary Question: Have daily concentrations of PM2.5 decreased in California over the 20 years spanning from 2002 to 2022?
Level 1: State Trends
Plots
Boxplot: PM2.5 Distribution in California: 2002 vs. 2022
The median and IQR of daily PM2.5 levels are similar across both years, but 2022 shows a higher number of extreme outliers, with some values exceeding 200. This suggests that while typical air quality remained stable, 2022 experienced more severe pollution spikes.
Line plot: Monthly PM2.5 in California: 2002 vs. 2022
Mean PM2.5 levels in 2002 were consistently higher, with peaks in spring and winter months. In contrast, 2022 shows a flatter, lower trend across all months, indicating better air quality, but still having some pollution spikes as noted in the box plot earlier.
Histogram: PM2.5 Frequency Distribution in California
Both years show a concentration of PM2.5 values at the lower end, but 2022 is more tightly clustered below 20 µg/m³. The broader spread in 2002 indicates more frequent moderate-to-high pollution days.
Barplot: Valid PM2.5 Observations by Year
The number of valid observations in 2022 is significantly higher than in 2002, possibly due to more frequent monitoring and improved data collection methods over time.
Statistics
Basic summary
In 2002, the mean PM2.5 was 16.1 µg/m³ with a wider spread (SD = 13.9), while 2022 saw a substantial drop to 8.41 µg/m³ and a smaller distribution (SD = 7.64). Despite the improvements, 2022 had 215 negative values and a maximum reading of 302 µg/m³, which suggests there may be occasional mistakes in data collection or extreme pollution events.
Extreme values
2002 had more frequent moderate-to-high pollution days, with 1,512 readings above 35 µg/m³ and 435 above 55. In contrast, 2022 had fewer moderate events but more extreme spikes, with 37 days having exceeded 100 µg/m³, compared to just 2 in 2002.
Monthly averages
Monthly PM2.5 in 2002 ranged from 13.3 to 20.7 µg/m³, peaking in April and June. In 2022, levels were consistently lower across all months, averaging around 8 µg/m³. By glancing at the table, there is an apparent difference between 2002’s all-double-digit PM2.5 values, versus 2022 with all-single-digit values.
Observation counts by year
The number of valid PM2.5 observations in California quadrupled from 15,976 in 2002 to 59,703 in 2022. This significant rise is likely due to more frequent monitoring as well as improved data reporting.
Plots
# California onlyca_data <- combined_data#### Boxplot of PM2.5 by year ----ggplot(ca_data, aes(x =factor(Year), y = PM25)) +geom_boxplot(fill ="lightblue") +labs(title ="PM2.5 Distribution in California: 2002 vs. 2022",x ="Year", y ="Daily PM2.5") +theme_minimal()
#### Line plot of monthly averages ----ca_data$Date <-as.Date(ca_data$Date)ca_data$Month <-factor(format(ca_data$Date, "%m"),levels =sprintf("%02d", 1:12),labels = month.abb)ggplot(ca_data, aes(x = Month, y = PM25, group = Year, color =factor(Year))) +stat_summary(fun = mean, geom ="line", linewidth =1) +stat_summary(fun = mean, geom ="point", size =2) +labs(title ="Monthly PM2.5 in California: 2002 vs 2022",x ="Month", y ="Mean PM2.5", color ="Year") +theme_minimal()
#### Histogram of PM2.5 Values ----ggplot(ca_data, aes(x = PM25, fill =factor(Year))) +geom_histogram(binwidth =2, alpha =0.6, position ="identity") +labs(title ="PM2.5 Histogram in California",x ="Daily PM2.5", fill ="Year") +theme_minimal()
#### Obs count ----ca_data %>%filter(PM25 >=0) %>%count(Year) %>%ggplot(aes(x =factor(Year), y = n, fill =factor(Year))) +geom_col() +labs(title ="Valid PM2.5 Observations by Year",x ="Year", y ="Count") +theme_minimal()
# A tibble: 24 × 3
Year Month monthly_avg
<dbl> <fct> <dbl>
1 2002 Jan 17.5
2 2002 Feb 15.6
3 2002 Mar 17.0
4 2002 Apr 20.7
5 2002 May 16.4
6 2002 Jun 19.1
7 2002 Jul 18.7
8 2002 Aug 13.3
9 2002 Sep 15.4
10 2002 Oct 13.6
11 2002 Nov 14.2
12 2002 Dec 18.3
13 2022 Jan 8.36
14 2022 Feb 7.80
15 2022 Mar 7.70
16 2022 Apr 7.82
17 2022 May 8.15
18 2022 Jun 7.79
19 2022 Jul 7.91
20 2022 Aug 8.44
21 2022 Sep 8.65
22 2022 Oct 8.00
23 2022 Nov 8.15
24 2022 Dec 8.06
#### Obs counts by yearca_data %>%filter(PM25 >=0) %>%count(Year)
Year n
1 2002 15976
2 2022 59703
Level 2: County Trends
Plots
Faceted line plots: Monthly PM2.5 Trend by County: 2002 vs 2022
Most counties show a clear reduction in monthly PM2.5 levels from 2002 to 2022, with flatter and lower curves in 2022. Seasonal peaks in 2002, especially in spring and summer, are less significant or absent in 2022. This suggests improved air quality across all of California, generally.
Ridgeline plot: PM2.5 Distribution by County
Across nearly all counties, the 2022 distributions are smaller and shifted toward lower PM2.5 values compared to 2002. While some counties still show long tails or individual high values, the overall density indicates there has been a major reduction in pollution.
Barplot: Observation Counts by County
Observation counts increased substantially in 2022 across nearly every county, with Los Angeles leading both years, with all other Southern California counties close behind. As mentioned above, there were likely improvements in data collection frequency, but it appears that those improvements were moreso concentrated within Southern California.
Statistics
Basic summary
Most counties experienced significant drops in mean PM2.5 levels from 2002 to 2022. Standard deviations (STDEV) generally declined, indicating fewer high-pollution days, though some counties like Trinity and Siskiyou had unusually high STDEV and maximum values in 2022.
Extreme values
The number of days exceeding moderate PM2.5 levels fell significantly in most counties by 2022, especially above 35 and 55 µg/m³. However, a few counties such as Placer, Mariposa, Trinity, and Siskiyou had multiple significant spikes above 100 µg/m³ in 2022, likely from uncommon pollution events.
Observation counts
Most counties had a significant increase in valid PM2.5 observations from 2002 to 2022, with some, like Los Angeles, Riverside, and Inyo, more than doubling their observations. This increase shows more geographic coverage, especially in areas/regions that seemed to have high levels of population. A few counties had no data in 2002 or 2022, indicating coverage gaps, which can lead to limitations in certain findings.
Plots
#### CA counties onlycounty_data <- ca_data %>%filter(!is.na(County))#### Faceted line plot ----combined_data$Date <-as.Date(combined_data$Date)combined_data$Month <-as.numeric(format(combined_data$Date, "%m"))ggplot(combined_data, aes(x = Month, y = PM25, group = Year, color =factor(Year))) +stat_summary(fun = mean, geom ="line") +stat_summary(fun = mean, geom ="point") +facet_wrap(~County) +labs(title ="Monthly PM2.5 Trend by County: 2002 vs 2022",x ="Month", y ="Mean PM2.5", color ="Year") +theme_minimal() +theme(axis.text.x =element_blank(),axis.ticks.x =element_blank(),text =element_text(size =14),legend.position ="bottom")
Warning: Removed 46003 rows containing non-finite outside the scale range
(`stat_summary()`).
Removed 46003 rows containing non-finite outside the scale range
(`stat_summary()`).
#### Ridgeline plot by country and year ----combined_data %>%filter(PM25 >=0, !is.na(County)) %>%ggplot(aes(x = PM25, y = County, fill =factor(Year))) +geom_density_ridges(alpha =0.6) +labs(title ="PM2.5 Distribution by County",x ="Daily PM2.5", y ="County", fill ="Year") +theme_minimal()
Picking joint bandwidth of 1.56
#### Observation counts ----combined_data %>%filter(PM25 >=0, !is.na(County)) %>%count(County, Year) %>%ggplot(aes(x = n, y =reorder(County, n), fill =factor(Year))) +geom_col(position ="dodge") +labs(title ="Observation Counts by County",x ="Number of Observations", y ="County", fill ="Year") +theme_minimal()
# A tibble: 51 × 3
County `2002` `2022`
<chr> <int> <int>
1 Alameda 201 1790
2 Butte 473 1109
3 Calaveras 60 355
4 Colusa 95 401
5 Contra Costa 276 815
6 Del Norte 110 452
7 El Dorado 208 228
8 Fresno 760 2758
9 Glenn NA 335
10 Humboldt 59 116
11 Imperial 342 1624
12 Inyo 277 3186
13 Kern 1800 2315
14 Kings 83 721
15 Lake 61 61
16 Los Angeles 1879 5064
17 Madera NA 360
18 Marin 97 476
19 Mariposa 290 573
20 Mendocino 122 700
21 Merced 89 719
22 Modoc 2 NA
23 Mono 111 944
24 Monterey 120 1107
25 Nevada 226 724
26 Orange 470 878
27 Placer 60 1765
28 Plumas 177 1118
29 Riverside 1017 4528
30 Sacramento 819 2544
31 San Benito 119 477
32 San Bernardino 835 2715
33 San Diego 1350 4597
34 San Francisco 196 352
35 San Joaquin 124 1008
36 San Luis Obispo 168 1422
37 San Mateo 100 350
38 Santa Barbara 152 1242
39 Santa Clara 459 1182
40 Santa Cruz 61 695
41 Shasta 271 480
42 Siskiyou 104 424
43 Solano 97 720
44 Sonoma 93 346
45 Stanislaus 183 744
46 Sutter 114 712
47 Tehama NA 347
48 Trinity 90 400
49 Tulare 508 1212
50 Ventura 556 2138
51 Yolo 112 374
Level 3: LA County Trends
Plots
Boxplot: PM2.5 Distribution in LA County: 2002 vs 2022
PM2.5 levels in 2022 were lower and less variable than in 2002, with a smaller IQR and smaller range of extreme outliers. The drop in median PM2.5 suggests an improvement in air quality.
Line plot: Monthly Mean PM2.5 in LA County
Across all months, 2022 shows consistently lower PM2.5 levels compared to 2002, with 2002 have a consistently high levels, with it’s highest peak at ~25 µg/m³ during the summer months. This indiciates reduced seasonal spikes (as seen in winter and summer in aforementioned plots) and improved air quality in LA County.
Time series: Daily PM2.5 in LA County
Daily PM2.5 in 2002 were significantly high, with frequent fluctuations and high peaks. In contrast, 2022 shows a flatter, more stable trend, with few significant spikes, indicating fewer adverse events that worsen pollution.
Barplot: Valid PM2.5 Observations in LA County
Observation counts more than doubled from 2002 to 2022, indicating improved monitoring and data collection methods.
Statistics
Basic summary
Mean PM2.5 in LA County dropped from 20.8 µg/m³ in 2002 to 11.1 in 2022, with a smaller distribution and lower variability. The median also fell by over 8 µg/m³, indicating better improvements in air quality in the county.
Extreme values
The number of days exceeding PM2.5 values over 100 stayed at 0 for both 2002 and 2022. However, 2002 had 95 days above 35 µg/m³ and 17 above 55, while 2022 had just 9 and 1, respectively.
Monthly averages
Monthly PM2.5 levels were consistently higher in 2002, peaking near 25 µg/m³ in July. In 2022, monthly averages stayed between 9 and 13 µg/m³, showing fewer seasonal spikes and steadier air quality levels. The table also indicates, that, generally, as time passes, levels of PM2.5 decrease.
Observation counts
Valid observations nearly tripled from 744 in 2002 to 2,014 in 2022, indicating better monitoring and data collection methods for LA County.
Plots
#### LA County onlyla_sites <- ca_data %>%filter(County =="Los Angeles")la_data <- combined_data %>%filter(County =="Los Angeles", PM25 >=0, !is.na(Date))#### Boxplot of PM2.5 by Year ----ggplot(la_data, aes(x =factor(Year), y = PM25)) +geom_boxplot(fill ="lightblue") +labs(title ="PM2.5 Distribution in LA County: 2002 vs 2022",x ="Year", y ="Daily PM2.5") +theme_minimal()
#### Monthly Trend Line ----la_data$Month <-factor(format(as.Date(la_data$Date), "%m"),levels =sprintf("%02d", 1:12),labels = month.abb)ggplot(la_data, aes(x = Month, y = PM25, group = Year, color =factor(Year))) +stat_summary(fun = mean, geom ="line", linewidth =1) +stat_summary(fun = mean, geom ="point", size =2) +labs(title ="Monthly PM2.5 in LA County: 2002 vs 2022",x ="Month", y ="Mean PM2.5", color ="Year") +theme_minimal()
#### Daily Time Series ----ggplot(la_data, aes(x =as.Date(Date), y = PM25, color =factor(Year))) +geom_line(alpha =0.4) +labs(title ="Daily PM2.5 in LA County",x ="Month (MM format)", y ="PM2.5", color ="Year") +theme_minimal()
#### Obs count ----la_data %>%count(Year) %>%ggplot(aes(x =factor(Year), y = n, fill =factor(Year))) +geom_col() +labs(title ="Valid PM2.5 Observations in LA County",x ="Year", y ="Count") +theme_minimal()